Goto

Collaborating Authors

 Assessment & Standards


CAM: AConstructivist View of Agentic Memory for LLM-Based Reading Comprehension

Neural Information Processing Systems

Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory--structured schemata, flexible assimilation, and dynamic accommodation.


SKETCHMIND: AMulti-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

Neural Information Processing Systems

Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SKETCHMIND, a cognitively grounded, multi-agent framework for evaluating and improving studentdrawn scientific sketches. SKETCHMIND introduces Sketch Reasoning Graphs (SRGs), semantic graph representations that embed domain concepts and Bloom's taxonomy-based cognitive labels. The system comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SKETCHMIND on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG(average accuracy: 55.6%), and with bSRGintegration achieves 77.1% average accuracy (+21.4% average absolute gain).


Matchings Under Biased and Correlated Evaluations

Neural Information Processing Systems

We study a two-institution stable matching model in which candidates from two distinct groups are evaluated using partially correlated signals that are groupbiased. This extends prior work (which assumes institutions evaluate candidates in an identical manner) to a more realistic setting in which institutions rely on overlapping, but independently processed, criteria. These evaluations could consist of a variety of informative tools such as standardized tests, shared recommendation systems, or AI-based assessments with local noise. Two key parameters govern evaluations: the bias parameter ฮฒ (0,1], which models systematic disadvantage faced by one group, and the correlation parameter ฮณ [0,1], which captures the alignment between institutional rankings. We study the representation ratio R(ฮฒ,ฮณ), i.e., the ratio of disadvantaged to advantaged candidates selected by the matching process in this setting.


Multi-Agent Debate for LLMJudges with Adaptive Stability Detection

Neural Information Processing Systems

With the advancing reasoning capabilities of Large Language Models (LLMs), they are increasingly employed for complex evaluation tasks, such as grading student responses, verifying factual claims, and comparing competing answers. Leveraging multiple LLMs as automated judges can enhance robustness and accuracy by aggregating diverse perspectives, yet existing approaches often rely on static and simple aggregation methods, such as majority voting, which may produce incorrect judgments despite correct individual assessments. We propose a novel multiagent debate framework where LLMs collaboratively reason and iteratively refine judgments, formalizing this process mathematically and proving its advantages over static ensembles. To ensure computational efficiency, we introduce a stability detection mechanism using a time-varying Beta-Binomial mixture model (a mixture of two Beta-Binomial distributions) that tracks judge consensus dynamics and applies adaptive stopping via Kolmogorov-Smirnov testing. Experiments across diverse benchmarks and models demonstrate significant improvements in judgment accuracy over majority voting while maintaining computational efficiency.


Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing

Neural Information Processing Systems

We introduce ExRec, a general framework for personalized exercise recommendation with semantically-grounded knowledge tracing. Our method builds on the observation that existing exercise recommendation approaches simulate student performance via knowledge tracing (KT) but they often overlook two key aspects: (a) the semantic content of questions and (b) the sequential, structured progression of student learning. To address this, our ExRec presents an end-to-end pipeline, from annotating the KCs of questions and learning their semantic representations to training KT models and optimizing several reinforcement learning (RL) methods. Moreover, we improve standard Q-learning-based continuous RL methods via a tailored model-based value estimation (MVE) approach that directly leverages the components of KT model in estimating cumulative knowledge improvement.


Causal Algorithmic Recourse: Foundations and Methods

arXiv.org Machine Learning

The trustworthiness of AI decision-making systems is increasingly important. A key feature of such systems is the ability to provide recommendations for how an individual may reverse a negative decision, a problem known as algorithmic recourse. Existing approaches treat recourse outcomes as counterfactuals of a fixed unit, ignoring that real-world recourse involves repeated decisions on the same individual under possibly different latent conditions. We develop a causal framework that models recourse as a process over pre- and post-intervention outcomes, allowing for partial stability and resampling of latent variables. We introduce post-recourse stability conditions that enable reasoning about recourse from observational data alone, and develop a copula-based algorithm for inferring the effects of recourse under these conditions. For settings where paired observations of the same individual before and after intervention are available (called recourse data), we develop methods for inferring copula parameters and performing goodness-of-fit testing. When the copula model is rejected, we provide a distribution-free algorithm for learning recourse effects directly from recourse data. We demonstrate the value of the proposed methods on real and semi-synthetic datasets.


ChatGPT trounces humans in entrance exams for top Japan university, study finds

The Japan Times

AI models surpassed the highest score recorded for a human test taker in this year's University of Tokyo entrance exam, a new study shows. If an artificial intelligence model such as ChatGPT had taken the entrance exams for Japan's top university in 2026, it would have been assessed as top of the class and admitted for scoring higher than any human test takers, a study by AI startup LifePrompt has found. The research used three major AI models -- ChatGPT 5.2 Thinking by OpenAI, Gemini 3 Pro Preview by Google and Claude Opus 4.5 by Anthropic -- and had them take the actual entrance exam used by the University of Tokyo in February 2026 to assess candidates for courses set to start in April. The university's category 3 science exam, often taken by those who want to enter the institution's medical school, is considered the most difficult exam to pass in Japan. In a time of both misinformation and too much information, quality journalism is more crucial than ever.



An Autoencoder-Like Nonnegative Matrix Co-Factorization for Improved Student Cognitive Modeling

Neural Information Processing Systems

Student cognitive modeling (SCM) is a fundamental task in intelligent education, with applications ranging from personalized learning to educational resource allocation. By exploiting students' response logs, SCM aims to predict their exercise performance as well as estimate knowledge proficiency in a subject. Data mining approaches such as matrix factorization can obtain high accuracy in predicting student performance on exercises, but the knowledge proficiency is unknown or poorly estimated. The situation is further exacerbated if only sparse interactions exist between exercises and students (or knowledge concepts). To solve this dilemma, we root monotonicity (a fundamental psychometric theory on educational assessments) in a co-factorization framework and present an autoencoder-like nonnegative matrix co-factorization (AE-NMCF), which improves the accuracy of estimating the student's knowledge proficiency via an encoder-decoder learning pipeline. The resulting estimation problem is nonconvex with nonnegative constraints. We introduce a projected gradient method based on block coordinate descent with Lipschitz constants and guarantee the method's theoretical convergence. Experiments on several real-world data sets demonstrate the efficacy of our approach in terms of both performance prediction accuracy and knowledge estimation ability, when compared with existing student cognitive models.